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Abstract 

Fat-trees  are  a  class  of  routing  networks  for  hardware- 
efficient  parallel  computation.  This  paper  presents  a 
randomized  algorithm  for  routing  messages  on  a  fat-tree. 
The  quality  of  the  algorithm  is  measured  in  terms  of  the 
load  factor  of  a  set  of  messages  to  be  routed,  which  is 
a  lower  bound  on  the  time  required  to  deliver  the  mes¬ 
sages.  We  show  that  if  a  set  of  messages  has  load  factor 
A  =  O(lgnlglgn)  on  a  fat-tree  with  n  processors,  the 
number  of  delivery  cycles  (routing  attempts)  that  the 
algorithm  requires  is  0(A)  with  probability  1  -0(l/n). 
The  best  previous  bound  was  C)(Algn)  for  the  off-line 
problem  where  switch  settings  can  be  determined  in  ad¬ 
vance.  In  a  VLSI-like  model  where  hardw.-ire  cost  is 
equated  with  physical  volume,  we  use  the  routing  algo¬ 
rithm  to  demonstrate  that  fat-trees  are  universal  routing 
networks  in  the  sense  that  any  routing  network  can  be 
efficiently  simulated  by  a  fat-tree  of  comparable  hard¬ 
ware  cost. 

1  Introduction 

Fat-trees  constitute  a  class  of  routing  networks  for 
general-purpose  p.arallel  computation.  This  paper  pre¬ 
sents  a  randomized  algorithm  for  routing  a  set  of  mes¬ 
sages  on  a  fat-tree.  The  routing  algorithm  and  its  anal¬ 
ysis  generalize  an  earlier  universality  result  by  showing, 
in  a  three-diiiiensioiial  VLSI  model,  that  for  a  given 
volume  of  hardware,  a  f.at-tree  is  nearly  the  best  rout¬ 
ing  network  that  can  be  built.  This  universality  result 
had  been  proved  only  for  off-line  simulations  j8(,  where 
switch  settings  can  be  determined  in  advance;  this  pa¬ 
per  extends  it  to  the  more  interesting  on-line  case,  where 
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Figure  1;  The  organization  of  a  fat-tree  Processors  .are  lo- 
c.ated  at  the  leaves,  .and  the  internal  nodes  contain  concentrator 
switches.  The  capacities  of  channels  increase  as  we  go  up  the  tree. 

messages  are  spontaneously  generated  by  processors. 

As  is  illustrated  in  Figure  1,  a  fat-tree  is  a  routing  net¬ 
work  based  on  Leighton’s  tree-of-meshes  graph  [7].  A  set 
of  n  processors  .■rre  located  at  the  le.aves  of  a  complete  bi¬ 
nary  tree.  Each  edge  of  the  underlying  tree  corresponds 
to  two  channels  of  the  fat-tree:  one  from  parent  to  child, 
the  other  from  child  to  parent.  Unlike  a  normal  tree  in¬ 
terconnection  which  is  “skinny  all  over,"  each  channel 
of  a  fat -tree  consists  of  a  bundle  of  wires.  The  number 
of  wu'es  in  a  channel  c  is  called  its  capacity,  denoted  by 
cap(c).  Each  internal  node  of  the  fat-tree  contains  cir¬ 
cuitry  that  switches  messages  from  incoming  to  outgoing 
channels. 

The  cap.acities  of  the  ch.annels  in  a  fat-tree  determine 
how  much  hardware  is  required  to  build  it,  where  we 
nie.asure  h.ardware  in  terms  of  three-dimensional  volume. 
The  greater  the  cap.icities  of  the  channels,  the  greater 
the  communication  potential,  and  also,  the  greater  the 
volume.  The  capacities  in  a  universal  fat-tree  |8|  grow 
exponentially  as  we  go  from  le.aves  to  root,  where  the 
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base  of  the  exponential  is  at  most  2.  Section  5  shows 
that  for  a  given  amount  of  hardware,  a  universal  fat-tree 
is  nearly  the  best  network  that  can  be  built. 

We  shall  consider  communication  through  the  fat- 
tree  network  to  be  synchronous,  bit  serial,  and  batched. 
By  synchronous,  we  mean  that  the  system  is  globally 
clocked.  By  bit  serial,  we  mean  th.at  the  messages  can 
be  thought  of  .as  bit  streams.  Each  message  snakes  its 
way  through  the  wires  and  switches  of  the  fat-tree,  with 
leading  bits  of  the  message  setting  switches  and  estab¬ 
lishing  a  path  for  the  remainder  to  follow.  By  batched, 
we  mean  the  messages  are  grouped  into  delivery  cycles. 
During  a  delivery  cycle,  the  processors  send  messages 
through  the  network.  Each  message  attempts  to  estab¬ 
lish  a  path  from  its  source  to  its  destination.  Since  some 
messages  may  be  unable  to  establish  connections  dur¬ 
ing  a  delivery  cycle,  each  successfully  delivered  message 
is  acknowledged  through  its  communication  path  .at  the 
end  of  the  cycle.  Rather  than  buffering  undelivered  mes¬ 
sages,  we  simply  allow  them  to  try  again  in  a  subsequent 
delivery  cycle.  The  routing  algorithm  is  responsible  for 
grouping  the  messages  into  delivery  cycles  so  that  all 
the  messages  are  delivered  in  as  few  cycles  as  possible. 

The  mechanics  of  routing  messages  in  a  fat-tree  are 
abnost  as  simple  as  routing  in  an  ordinary  tree.  For  each 
message,  there  is  a  unique  path  from  its  source  processor 
to  its  destin.ation  processor  in  the  underlying  complete 
binary  tree,  which  can  be  specified  by  a  rel.ative  .address 
consisting  of  at  most  2  Ig  n  bits  telling  whether  the  mes¬ 
sage  turns  left  or  right  at  each  internal  node. 

Within  each  node  of  the  fat-tree,  the  messages  des¬ 
tined  for  a  given  output  channel  .aio  concentrated  onto 
the  available  wires  of  that  channel.  This  concentration 
may  result  in  “lost”  messages  if  the  number  of  mess.ages 
destined  for  the  output  channel  exceeds  the  capacity 
of  the  channel.  We  assume,  however,  that  the  concen¬ 
trators  within  the  node  are  ideal  in  the  sense  that  no 
messages  are  lost  if  the  number  of  messages  destined  for 
a  channel  is  less  than  or  equal  to  the  cap.icity  of  the 
channel.  Such  a  concentrator  can  be  built,  for  example, 
with  a  log-depth  sorting  network  |l).  A  more  practical 
log-depth  circuit  can  be  built  by  combining  a  parallel 
prefix  circuit  [G|  with  a  butterfly  (a.  k.  a.  FFT,  Omega) 
network. 

The  performance  of  .my  routing  .  Igorithm  depends  on 
the  locality  of  communication  in  a  set  of  messages  be¬ 
cause  some  messages  may  be  routed  loc,illy  within  sub¬ 
trees  of  the  fat-tree  without  soaking  up  bandwidth  near 
the  root.  The  locality  of  comniunication  for  a  message 
set  M  can  be  summarized  by  a  measure  A(A/)  called  the 
load  factor,  which  we  define  in  a  more  general  network 
setting. 


_ load  factor _ delivery  cycles 

0<A(M)<1  1 

1  <  A(A/)  <  2  O(lgn) 

2  <  A(A/)  <  Ignlglgn  ojlg  nlg(A(A/))) 

Ignlglgn  <  A{A/)  <  poly(n)  0(A{M)) 


Figure  2:  Number  of  delivery  cycles  required  to  deliver  a 
mess.-tge  set  M  on  n  fat-tree  with  n  processors.  All  bounds  arc 
achieved  with  prob-ability  1  -  0(l/n)  We  assume  in  line  4  that 
the  load  factor  A(M)  is  polynomially  bounded. 


S  of  wires  in  R  is  a.  (directed)  cut  if  it  partitions 
the  network  into  two  sets  of  processors  A  and 
B  such  that  every  path  from  a  processor  in  A 
to  a  processor  in  B  contains  a  wire  in  5.  The 
capacity  cap(J>)  is  the  number  of  wires  in  the 
cut.  For  a  set  of  messages  M,  define  the  load 
load(M,  5)  of  M  on  a  cut  S  to  be  the  number 
of  messages  in  M  that  must  cross  S.  The  load 
factor  of  M  on  5  is 


X(M,S) 


load(Af,5) 

cap(5) 


and  the  load  factor  of  M  on  the  entire  network 
R  is 

A(A/)  =  maxA(iV/, 5)  . 


The  load  factor  provides  a  simple  lower  bound  on  the 
number  of  delivery  cycles  requii'ed  to  deliver  a  set  of 
messages.  When  the  set  of  messages  is  known  in  ad¬ 
vance,  it  has  been  shown  [8)  that  a  set  M  of  messages 
can  be  delivered  in  0(A(Af )  Ig  n)  delivery  cycles  on  a  fat- 
tree  with  n  processors.  Our  routing  algorithm,  whose 
running  time  is  summarized  in  Figure  2,  improves  this 
off-line  result  in  two  w.ays.  First,  the  algorithm  does 
not  need  to  know  the  set  of  messages  in  advance,  but 
can  deliver  them  on-line.  Second,  the  bounds  on  run¬ 
ning  time  generally  improve  (and  always  at  least  match) 
the  previous  off-line  bound.  The  only  caveat  is  that  our 
algorithm  is  randomized  instead  of  being  deterministic, 
but  the  stated  bounds  .are  .achieved  with  liigh  probabil¬ 
ity. 

The  analysis  in  terms  of  lo.ad  f.actor  is  not  restricted 
to  permutation  routing  or  situations  where  e.ach  pro¬ 
cessor  can  only  send  or  receive  a  constant  number  of 
messages,  as  is  common  in  the  liter.ature.  We  consider 
the  general  situation  where  each  processor  can  send  and 
receive  polynomially  many  messages.  Furthermore,  we 
make  no  .assumptions  about  the  statistical  distribution 
of  mess.ages,  except  insofar  as  they  affect  the  load  factor. 


Defiuitioii:  Let  R  be  a  i  outing  network.  A  set 


Our  routing  algorithm  also  differs  from  others  in  the 
literature  in  the  way  randomization  is  used.  Unlike  the 
algorithms  of  Valiant  |11|,  Valiant  and  Brebner  [12], 
Aleliunas  [2|,  Upfal  [lOj  and  Pippenger  [9),  for  exam¬ 
ple,  it  does  not  randomize  with  respect  to  paths  taken 
by  messages.  Instead,  for  each  delivery  cycle,  each  un¬ 
delivered  message  randomly  chooses  whether  to  be  sent. 

The  remainder  of  this  paper  is  organized  as  follows. 
Section  2  describes  the  randomized  algorithm  for  rout¬ 
ing  on  fat-trees.  Section  3  contains  some  preliminary 
lemmas  needed  to  analyze  the  algorithm,  and  Section  4 
contains  the  full  analysis.  Section  5  contains  a  vari¬ 
ety  of  results  that  follow  from  the  randomized  routing 
algorithm.  It  shows  how  the  universality  result  of  (8| 
can  be  extended  to  on-line  simulations,  and  it  includes 
a  modification  of  the  routing  algorithm  which  achieves 
better  bounds  when  each  channel  has  capacity  n(lgn). 
It  also  gives  an  existential  lower  hound  for  the  naive 
greedy  approach  to  routing  me.^sages  which  shows  that 
the  greedy  strategy  is  inferior  to  the  randomized  algo¬ 
rithm  for  worst  case  inputs.  Finally,  Section  6  contains 
some  concluding  remarks. 

2  The  routing  algorithm 

This  section  gives  a  randomized  algoritlim  for  routing 
a  set  Af  of  messages,  which  is  based  on  routing  ran¬ 
dom  subsets  of  the  messages  in  Af.  The  algorithm 
RANDOM  is  shown  in  Figure  3,  and  it  uses  the  sub¬ 
routine  TRY-Ol'ESS  Aiown  in  Figure  1.  Section  4  will 
provide  a  proof  that  on  an  n-processor  fat-tree,  the  prob¬ 
ability  is  at  least  I  -  0(l/  n)  that  RANDOM  delivers 
all  messages  in  M  within  the  number  of  delivery  cylces 
specified  by  Figure  2,  if  the  two  constants  ki  and  k-y 
appe.aring  in  the  algorithm  are  properly  chosen. 

The  basic  idea  of  RANDOM  is  to  pick  a  ramlom  sub¬ 
set  of  messages  to  send  in  each  delivery  cycle  by  inde¬ 
pendently  choosing  e.ach  message  with  some  probability 
p.  This  idea  is  suHicienlly  important  to  merit  a  formal 
definition. 

Definition:  A  p-suhset  of  M  is  a  subset  of  A/ 
formed  by  independently  choosing  each  mes¬ 
sage  of  A/  with  probability  p. 

We  will  show  in  Section  4  that  if  p  is  sufficiently  small, 
a  substantial  portion  of  the  mess.ages  in  a  p-subset  are 
delivered  because  they  encounter  no  congestion  during 
routing.  On  the  other  hand,  if  p  is  too  small,  few  mes¬ 
sages  are  sent.  RANDOM  varies  the  probability  p  from 
cycle  to  cycle,  seeking  random  subsets  of  M  which  con¬ 
tain  a  substantial  portion  of  the  messages  in  M  but 
which  do  not  cause  congestion. 

The  algorithm  RANDOM  varies  the  probability  p  be¬ 
cause  the  load  factor  A(A/)  is  not  known.  The  over- 


1  send  M 

2  U  •—  M  -  {messages  delivered} 

3  kgxiess  * —  2 

4  while  kiXguess  <  ^2*8^1  and  U  ^  9  do 

5  TRY-GUESS[Xguess) 

C  Xguess  ^guess 

7  endwhile 

8  Xguess  (A:2Ai)  Ignlg  Ign 

9  while  U  ^  9  do 

10  TRYGUESS(Xgue,s) 

11  kguess  ‘2XgueSS 

12  endwhile 

Figure  3:  The  r.indomited  .algorithm  RANDOM  tor  delivering 
a  message  set  M  on  .a  fat-tree  with  n  processors  This  algorithm 
.achieves  the  running  times  in  Figure  2  with  high  probability  if 
the  c  mst.ants  ki  and  are  appropriately  chosen.  Since  the  load 
factor  A(Af)  is  not  known  in  advance,  it  is  necessary  to  make 
gi.  .svs,  each  one  being  tried  out  by  the  subroutine  TRY-GUESS. 


procedure  TRY- Gl^ESS{ X guess) 

1  A  •—  Xguess 

2  while  A  >  1  do 

3  for  i  <—  1  to  max  {^:i  A,  k^  Ig  n}  do 

4  independently  send  each  message  of  U 
with  probability  l/rA 

5  U  —  U  -  {messages  delivered} 

C  endfor 

7  A  —  A/2 

8  endwhile 

9  send  U 

10  •—  -  {messages  delivered} 

Figtire  4:  The  fiihroutine  TRY-Gl'ESS  used  by  the  algonthm 
RANDOM  which  tries  to  deliver  the  set  U  of  currently  undelivered 
messages  When  >igue)3  H^')t  this  attempt  will  be  successful 
witli  high  probability,  if  the  constants  fc|  .and  kj  lire  appropriately 
chosen  (The  value  r  is  the  congestion  p.arameter  of  the  fat-tree 
detined  in  Section  4.  which  is  typic.ally  a  small  constant  )  In  that 
case,  X  IS  always  an  upper  bound  on  A(t'),  which  is  at  least  h.alved 
ill  eacJi  iteration  of  the  while  loop.  When  llie  loop  is  finished, 
!£  1,  so  all  the  lemaming  messages  can  be  sent. 
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all  structure  of  RANDOM  is  to  guess  the  load  fac¬ 
tor  and  call  the  subroutine  TRY-Gl^ESS  for  each  one. 
TRY-GUESS  deteriiiines  the  probability  p  based  on 
RANDOMS  guess  Xguess  parameter  r,  called  the 

congestion  parametCT,  which  will  be  defined  in  Section  4. 
If  Aguess  is  '■'1*  upper  bound  oil  the  true  load  factor 
A(iV/),  each  iteration  of  the  while  loop  in  TRY-GUESS 
halves  ^guess  with  high  probability,  as  will  be  shown 
in  Section  4.  When  the  loop  is  finished,  we  have 
X(U)  <  Xguess  <  li  and  all  the  remaining  messages 
can  be  delivered  in  one  cycle.  The  miiuber  of  delivery 
cycles  performed  by  TRY-GUESS  is  0(lg  Aj^ue^i  Ig  n) 
if  2  <  Xguess  <  0(lgti),  and  the  number  of  cycles  is 
0(Xguess  +  Ignlglgn)  if  Xguess  = 

RANDOM  must  make  judicious  guesses  for  the  load 
factor  because  TRY-GUESS  may  not  be  effective  if 
its  guess  is  smaller  than  the  true  load  factor.  Con¬ 
versely,  if  the  guess  is  too  large,  too  many  delivery  cycles 
will  be  performed.  Since  the  amount  of  work  done  by 
TRY-GUESS  grows  as  Ig  Xguess  for  Xguess  small,  and 
as  Xguess  for  Xguess  large,  there  are  two  main  phases 
to  RANDOMS  guessing.  (These  phases  follow  the  han¬ 
dling  of  very  small  load  factors,  i.e.,  X{M)  <  2.) 

In  the  first  phase,  the  guesses  are  squared  from  one 
trial  to  the  ne.xt.  Once  Xguess  sufficiently  large,  we 
move  into  the  second  phase,  and  the  guesses  are  doubled 
from  one  trial  to  the  next.  In  each  phase,  the  number 
of  delivery  cycles  run  by  TRY'-GUESS  from  one  call  to 
the  next  forms  a  geometric  series.  Thus,  the  work  done 
in  any  call  to  TRY-GUESS  is  only  a  constant  factor 
times  all  the  work  done  prior  to  the  call.  With  this 
guessing  strategy,  we  can  deliver  a  message  set  using 
only  a  constant  factor  more  delivery  cycles  than  would 
be  required  if  we  knew  the  load  factor  in  advance. 

3  Preliminary  lemmas 

This  section  contains  three  lemmas  that  will  be  needed 
to  analyze  the  algorithm  RANDOM  from  the  preceding 
section.  The  first  lemma  lelates  the  definition  of  lo.ad 
factor  given  in  Section  1  to  the  channel  structure  of  the 
fat-tree.  The  other  two  are  technical  lemm.as  concerning 
basic  probability.  One  is  a  combinatorial  bound  on  the 
tail  of  the  binomial  distribution  of  s  he  kind  attributed  to 
Chernoff  [4|,  and  tlie  other  is  a  more  general,  but  weaker, 
bound  on  the  probability  that  a  random  variable  takes 
on  values  smaller  than  the  expectation. 

The  first  lemma  slates  that  in  a  fat-tree,  the  load 
factor  of  a  set  of  messages  is  determined  by  the  cuts  on 
the  channels  alone. 

Lemma  1  The  load  factor  of  a  set  M  of  mes¬ 
sages  on  a  fat-tree  is 

X(M)  =  max \(M.c)  , 


where  c  ranges  over  all  channels  of  the  fat- 
tree.  I 

The  next  lemma  is  a  “Chernoff”  bound  on  the  tail  of  a 
binomial  distribution.  Suppose  that  we  have  t  indepen¬ 
dent  Bernoulli  trials,  each  with  probability  p  of  success. 
It  is  well  known  |5{  that  the  probability  that  there  are 
at  le.ast  s  successes  out  of  the  t  trials  is 

The  lemma  bounds  the  probability  that  the  number  of 
successes  is  larger  than  the  expectation  pt. 

Lemma  2 

B(s,t,p)  <  ■  I 

The  final  lemma  in  this  section  bounds  the  probability 
that  a  bounded  random  variable  takes  on  values  smaller 
than  the  expectation. 

Lemma  Z  Let  X  <  b  be  a  random  variable 
with  expectation  p.  Then  for  any  w  <  p,  we 
have 

Pr{Af  <  u;}  <  1  -  .  I 

6  — 

4  Analysis  of  the  routing  algorithm 
RANDOM 

This  section  contains  the  analysis  of  RANDOM,  the 
routing  algorithm  for  fat-trees  presented  in  Section  2. 
We  shall  show  that  the  probability  is  1  -  0(1 /n)  that 
RANDOM  delivers  a  set  of  M  of  messages  on  a  univer¬ 
sal  fat-tree  with  n  processors  in  the  number  of  delivery 
cycles  given  by  Figure  2 

We  begin  by  analyzing  the  routing  of  a  p-subset  M' 
of  a  set  M  of  messages.  If  the  number  load (M',c)  of 
messages  in  M'  that  must  pass  through  c  is  no  more 
than  the  cap.acity  cap((;),  then  no  messages  will  be  lost 
by  concentrating  the  messages  into  c.  We  shall  say  that 
c  is  congested  by  M'  if  load(.V/',c)  >  cap(c).  We  now 
show  that  the  likelihood  of  ch.Tuuel  congestion  decreases 
exponentially  with  channel  c.ipai  ity  if  the  probability  of 
choosing  a  given  message  out  of  M  is  sufficiently  small. 

Lemma  4  Let  M  be  a  set  of  messages  on  a 
fat- tree,  let  A(A/)  be  the  load  factor  on  the  fat- 
tree  due  to  M ,  let  M'  be  a  p-subset  of  messages 
from  M ,  and  let  c  be  a  channel  through  which 
a  given  message  m  t  M'  must  pass.  Then 
the  probability  is  at  most  (epX{M))'  that 
channel  c  is  congested  by  M' . 
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Proof .  Channel  c  is  congested  by  Af'  if  load(A(f',c)  > 
cap(c).  There  is  already  one  message  from  the  set  Af' 
going  through  channel  c,  so  we  must  determine  a  bound 
on  the  probability  that  at  least  cap(L')  other  messages 
go  through  c.  Using  Lemma  2  with  s  ==  cap(c)  and 
t  =  load(Af, c),  the  probability  that  the  number  of  mes¬ 
sages  sent  through  channel  c  is  greater  than  the  capacity 
cap(c]  is  less  than 

D(  <  \  i  j(»/  \  \  ^  /'epload(Af,c)\"‘‘‘''’ 

B(cap(c),load(Af,c),p)  <  - j 

<  (epA(Af '  .  I 

The  next  lemma  will  analyze  the  probability  that  a 
given  message  of  a  p-subset  of  Af  gets  delivered.  In  or¬ 
der  to  do  the  analysis,  however,  we  must  select  p  small 
enough  so  that  it  is  likely  that  the  message  passes  ex¬ 
clusively  through  uncongested  channels.  The  choice  of 
p  depends  on  the  capacities  of  channels  in  the  fat-tree. 
For  convenience,  we  define  one  parameter  of  the  capaci¬ 
ties  which  will  enable  us  choose  a  suitable  upper  bound 
for  p. 

Definition:  The  congestion  parameter  r  of  fat- 
tree  is  the  smallest  positive  value  such  that  for 
each  simple  path  Ci,  cj,  ...,  C(  of  channels  in 
the  fat-tree,  we  have 

k=l 

For  a  fat-tree  based  on  a  complete  bijiary  tree,  the 
longest  simple  path  is  at  most  2  Ig  n,  where  n  is  the 
number  of  processors,  and  thus  r  <  4e  Ig  n.  For  universal 
fat-trees,  the  congestion  parameter  is  a  constant  because 
the  capacities  of  channels  grow  exponentially  as  we  go 
up  the  tree.  (All  we  really  need  is  arithmetic  growth 
in  the  channel  capacities.)  The  congestion  parameter  is 
also  constant  for  any  fat-tree  based  on  a  complete  binary 
tree  if  all  the  channels  have  capacity  n(lg  Ign),  The  re¬ 
maining  analysis  treats  the  congestion  parameter  r  as  a 
constant,  but  the  analysis  does  not  change  substantially 
for  other  cases. 

We  now  present  the  lemma  that  analyzes  the  proba¬ 
bility  that  a  given  message  gets  delivered. 

Lemma  5  Let  Af  be  a  set  of  messages  on  a 
fat-tree  with  congestion  parameter  r,  let  A(A/) 
be  the  load  factor  on  the  fat-tree  due  to  Af,  and 
let  m  be  an  arbitrary  message  in  Af.  Suppose 
Af'  IS  a  p-subset  of  Af,  where  p  <  l/rA(Af). 

Then  if  Af'  IS  sent,  the  probability  that  m  gets 
delivered  is  at  least  *p. 


Proof.  The  probability  that  m  €  Af  is  delivered  is  at 
least  the  probability  that  m  €  Af'  times  the  probability 
that  m  passes  exclusively  through  uncongested  chan¬ 
nels.  The  probability  that  m  €  Af'  is  p,  and  thus  we 
need  only  show  that,  given  m  6  Af',  the  probability 
is  at  least  k  *^hat  every  channel  through  which  m  must 
pass  is  uncongested.  Let  Ci,  C2,  •  •  ■ ,  C|  be  the  channels  in 
the  fat-tree  through  which  m  must  pass.  The  probabil¬ 
ity  that  channel  c*.  is  congested  is  less  than  * 

by  Lemma  4.  The  probability  that  at  least  one  of  the 
channels  is  congested  is,  therefore,  much  less  than 


i:(;) 


‘  itph  *) 


by  definition  of  the  congestion  parameter.  Thus,  the 
probability  that  none  of  the  channels  are  congested  is  at 
least  ^ .  I 

We  now  focus  our  attention  on  RANDOM  itseU.  The 
next  lemma  analyzes  the  innermost  loop  (lines  3-6)  of 
RANDOMS  subroutine  TRY-GUESS.  At  this  point  in 
the  algorithm,  there  is  a  set  U  of  undelivered  messages 
and  a  value  for  A.  The  lemma  shows  that  if  A  is  indeed  an 
»7per  bound  on  the  load  factor  h(U)  of  the  undelivered 
messages  when  the  loop  begins,  then  A/2  is  an  upper 
bound  after  the  loop  terminates.  This  lemma  is  the 
crucial  step  in  showing  that  RANDOM  y/orks. 


Lemma  6  Let  U  be  a  set  of  messages  on  an 
n-processor  fat-tree  with  congestion  parameter 
r,  and  assume  h(U)  <  A.  Then  after  lines  S- 
6  of  RANDOM’S  subroutine  TRY-GUESS,  the 
probability  is  at  most  0(l/n^)  that  h^U)  >  ^A. 

Proof.  The  idea  of  the  proof  is  to  show  that  the  load 
factor  of  an  arbitrary  channel  c  remains  larger  than  4  A 
with  probability  C>(l/n^).  Since  the  channel  c  is  chosen 
arbitrarily  out  of  the  4n  -  2  channels  in  the  fat-tree,  the 
probability  is  at  most  C>(l/n^)  that  any  of  the  channels 
is  left  with  load  factor  larger  than  :^A. 

For  convenience,  let  C  be  the  subset  of  messages  that 
must  pass  through  channel  c  and  are  undelivered  at 
the  beginning  of  the  innermost  loop  in  RANDOM.  Let 
Cl  =  C,  and  for  t  >  1,  let  C,  C  C,_i  denote  the  set 
of  undelivered  messages  at  the  end  of  the  ith  iteration 
of  the  loop.  Notice  that  A(C, ,c)  =  |C, |/cap(c),  since 
|C,|  =  load(C.,c). 

We  now  show  there  exists  values  for  the  constants 
kj  and  in  line  3  of  TRY-G UESS  such  that  for 
2  =  max  (fci  A,  ko  Ig  n),  the  probability  is  C>(l/n^)  that 
''(C;,c)  >  jA,  or  equivalently,  that 

|C.|  >  ^Acap(c)  . 


(1) 


It  suffices  to  prove  that  the  probability  is  0(l/n^) 
that  fewer  than  ^  |C|  messages  from  C  are  delivered 
during  the  z  cycles  under  the  assumption  that  IC.I  > 
4Acap(c)  for  t  =  0,  1,  . . . ,  2  —  1.  The  intuition  behind 
the  assumption  |C,  |  >  A>cap(c)  is  that  otherwise,  the 
load  factor  on  channel  c  is  already  at  most  at  this 
step  of  the  iteration.  The  reason  we  need  only  bound  the 
probability  that  fewer  than  |  |C|  messages  are  delivered 
during  the  2  cycles  is  that  inequality  (1)  implies  that 
the  number  of  messages  delivered  is  fewer  than  |C|  - 
jAcapfc)  <  |C|  -  5A(C,  c)cap(c)  <  k  1C|. 

We  shall  establish  the  0(l/n^)  bound  on  the  prob¬ 
ability  that  at  most  A  |(7|  messages  are  delivered  in 
two  steps.  For  convenience,  we  shall  call  a  cycle  good 
if  at  least  cap(c)/8r  messages  are  delivered,  and  bad 
otherwise.  In  the  first  step,  we  bound  the  probabil¬ 
ity  that  a  given  cycle  is  bad.  Using  Lemma  .5  with 
p  =  1/rA  <  l/rX{U)  <  l/rA(C,)  in  conjunction  with 
the  assumption  that  {C,|  >  ^Acap(c),  we  can  conclude 
that  the  expected  number  of  messages  delivered  in  any 
given  cycle  is  greater  than  j^A^capfc)  >  cap(c)/4r. 
Then  by  Lemma  3,  the  probability  ‘hat  a  a  given  cycle 
is  bad  is  at  most  1  —  l/(8r  -  1)  <  1  —  l/8r.  (Although 
this  bound  is  sufficiently  strong  to  pi  jve  our  theoretical 
results,  it  is  we.ak  because  the  probability  that  a  message 
is  delivered  in  a  given  cycle  is  not  independent  from  the 
probabilities  for  other  messages,  and  thus  we  must  rely 
on  the  bound  given  by  Lemma  3.  In  practice,  one  would 
anticipate  that  the  dependencies  are  weak,  and  that  the 
algorithm  would  be  effective  with  much  smaller  values 
for  the  constants  and  than  wj  can  prove  here.) 

The  second  step  l)ounds  the  probability  that  a  sub¬ 
stantial  fraction  of  the  2  delivery  cycles  are  bad.  Specif¬ 
ically,  we  show  that  the  probability  is  1  -  0(l/n^)  that 
at  least  some  small  constant  fraction  q  of  the  r  cycles  are 
good.  By  picking  ki  =  4r/q,  which  implies  z  >  ArXjq, 
at  least  qzcap(c)/8r  >  A  (C|  messages  will  be  delivered. 
We  bound  the  probability  that  at  least  (1  -  q)z  of  the  2 
cycles  are  bad  by  using  a  counting  argument.  There  are 
'vays  of  picking  the  bad  cycles,  and  the  proba¬ 
bility  that  a  cycle  is  bad  is  at  most  1  -  l/8r.  Thus,  the 
probability  that  at  most  *  |(^|  messages  are  delivered  is 


Pr  {<  |C|  messages  delive  ed} 

<  A) 


I  I  -•/!: 


if  we  choose  q  =  l/e'’rlnr,  as  the  reader  may  verify. 
Since  2  =  max  {A:i  A, /c-i  Ig  n} ,  if  we  choose  k,  —  36r,  the 
probability  that  fewer  than  *  |C|  messages  are  delivered 


is  at  most  1/n^.  | 

Now  we  con  analyze  RANDOM  as  a  whole. 

Theorem  7  For  any  message  set  M  on  an 
n-processor  fat-tree,  the  probability  is  at  least 
1  —  0(1  fn)  that  RANDOM  will  deliver  all  the 
messages  of  M  within  the  number  of  delivery 
cycles  specified  by  Figure  2. 

Proof.  First,  we  will  show  that  if  X guess  ^  A(A/),  the 
probability  is  at  most  0(l/n)  that  the  loop  in  lines  2 
through  8  of  TRY-GUESS  fails  to  yield  A([/)  <  1.  Ini¬ 
tially,  A  >  X(U),  and  we  know  from  Lemina  6  that 
the  probability  is  at  most  0(l/n^)  that  any  given  it¬ 
eration  of  the  loop  fails  to  restore  this  condition  as  A 
is  halved.  Since  there  are  IgA^uess  iterations  of  the 
loop,  we  need  only  make  the  reasonable  assumption  that 
hguess  is  polynomial  in  n  to  obtain  a  probability  of  at 
most  0(l/n)  that  X(U)  remains  greater  than  1  after  all 
the  iterations  of  the  loop. 

Now  we  just  need  to  count  the  number  of  delivery  cy¬ 
cles  which  will  have  been  completed  by  the  time  we  call 
TRY-GUESS  with  a  Xguess  such  that  A(A/)  <  Xguess- 
Let  us  denote  by  Xg^ggg  the  first  Xguess  which  satisfies 
this  condition,  and  then  break  the  analysis  down  into 
cases  according  to  the  value  of  X(M). 

For  X(M)  <  1,  we  do  not  actually  even  call 

TRY-GUESS.  We  need  only  count  the  one  delivery  cycle 
executed  in  line  1  of  RANDOM. 

For  1  <  X[M)  <  2,  we  need  add  only  the  Ig  n  cycles 
executed  when  we  call  TRY-GUESS(2). 

For  2  <  X{M)  <  Ig  n,  the  number  of  deliv¬ 

ery  cycles  involved  in  each  execution  of  TRY-GUESS 
is  0(lg  Ayuej,iA:>  Ig  n),  since  we  perform  O(lgAyueis)  it¬ 
erations  of  the  loop  in  lines  2-8  of  TRY-GUESS,  each 
containing  k^  Ig  n  iterations  of  the  loop  in  lines  3-0.  The 
value  of  Xg^ggg  is  at  most  A(Af)^,  so  the  total  number 
of  delivery  cycles  is 

0(lg  n  Ig  A(Af)-)  -h  0(lg  n  Ig  A(Af))  4-  0(lg  n  Ig  i/^M]) 
-0(lgnlg  </A(M))+  +0(lgn) 

^  0(lgnlg(A(A/)=‘")) 

n<i<l+lK  \'A  A(  Al) 

^  0(2‘-lgnlg(A(Af))) 

=  0(lgnlgA(Af))  , 

siiite  the  series  is  geometric. 

For  A(A/)  >  (Vi/fc.jlg  n,  the  number  of  delivery  cy¬ 
cles  executed  by  the  time  we  reach  line  8  of  R.ANDOM is 
O(lgnlglgn)  according  to  the  preceding  analysis,  and 
then  we  must  continue  in  the  quest  to  reach  Xguess- 
If  A(Af)  <  (t'.'/fci)  Ig  n  Ig  Ig  n,  then  we  need  only  add 
the  number  of  delivery  cycles  involved  in  the  single  call 
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TRY-GUESS((k2/ki)\%n\^\gn).  This  additional  num¬ 
ber  of  delivery  cycles  is  also  C>(lg  n  Ig  Ig  n),  which  is 
0(lgnlgA(M)). 

If  A(A/)  >  (ito/ifci)  Ig  n  Ig  Ig  n,  the  number  of  delivery 
cycles  executed  before  reaching  line  8  is  0(lg  n  Ig  Ig  n)  as 
before,  which  is  0(X(M)).  We  must  then  add  0{Xgtiess) 
cycles  for  each  call  of  TRY-GUESS  in  line  10.  Since 
kguess  ^  most  2A(Af),  the  total  additional  number 
of  delivery  cycles  is 

0(2A(M))  +  0(A(A/))  +  0{X{M)/2)  + 

-K:>(lgnlg  Ign) 

=  Y,  0(2‘-A(A/)) 

0<t<t 

=  o(a'(M)). 

where  t  =  1-t- lg(A:iA(  A/j/lco  Ig  n  Ig  Ig  n).  The  total  num¬ 
ber  of  delivery  cycles  is  thus  0(A(Af)).  | 

The  1  —  0(l/n)  bound  on  the  probability  that 
RANDOM  delivers  all  the  messages  can  be  improved 
to  1  —  0(l/n^)  for  any  constant  k  by  choosing  /co  = 
12(k  +  2)r,  or  by  simply  running  the  algorithm  through 
more  choices  of  X guess- 

We  can  also  use  RANDOM  to  obtain  a  routing  al¬ 
gorithm  which  guarantees  to  deliver  all  the  messages 
in  finite  time  with  expected  number  of  delivery  cycles 
given  in  Figure  2.  We  simply  interleave  RANDOM  with 
any  routing  strategy  that  guarantees  to  deliver  at  least 
one  message  in  each  delivery  cycle.  If  the  number  of 
messages  is  bounded  by  some  polynomial  n*‘,  then  we 
choose  kn  such  that  RANDOM  works  with  probability 
1  - 

5  Further  results 

This  section  contains  additional  results  relevant  to  rout¬ 
ing  on  fat-trees.  We  first  present  an  improved  version 
of  the  universality  theorem  from  |8|.  Then  we  give  two 
results  on  fat-tree  routing  that  follow  from  the  analysis 
of  RANDOM.  Finally,  we  show  that  there  are  message 
sets  on  which  a  greedy  routing  strategy  fails  to  work 
well. 

Universality 

The  performance  of  the  routing  algorithm  RANDOM 
allows  us  to  generalize  the  universality  theorem  from  |8], 
which  states  that  a  universal  fat-tree  of  a  given  volume 
can  simulate  any  other  routing  network  of  equal  volume 
with  only  a  polylog  factor  increase  in  the  lime  required. 
The  original  proof  assumed  the  siiiinlation  of  the  routing 
was  off-line.  Onr  results  show  that  the  simulation  can 
be  carried  out  in  the  more  interesting  on-line  context. 


Theorem  8  Let  FT  be  a  universal  fat- tree  of 
volume  V  on  a  set  of  n  processors,  and  let  R 
be  an  arbitrary  routing  netujorfc  also  of  vol¬ 
ume  V  on  a  set  of  n  processors.  Then  there 
is  an  identification  of  processors  in  FT  with 
the  processors  of  R  with  the  following  prop¬ 
erty.  Any  message  set  M  that  can  be  delivered 
in  time  t  by  R  can  be  delivered  by  FT  in  time 
0((t-l-iglgn)  Ig^  n)  with  probability  l  —  0(l/n). 

Sketch  of  proof.  The  proof  follows  that  of  (8|.  The  reader 
is  referred  to  that  paper  for  details.  The  routing  network 
R  of  volume  v  is  mapped  to  FT  in  such  a  way  that  any 
message  set  M  that  can  be  delivered  in  time  f  by  fi 
puts  a  load  factor  of  at  most  0[t\g(n/ on  FT. 
By  Theorem  7,  the  message  set  M  can  be  delivered  by 
RANDOMm  0((lg(n/t)^''^)-t-lgniglgn)  delivery  cycles 
with  high  probability.  Since  each  delivery  cycle  takes  at 
most  0(lg^  n)  time,  the  result  follows.  | 

Remark.  The  delivery  cycle  time  of  the  off-line  fat- 
trees  presented  in  (8|  is  ©(Ign).  The  on-line  fat-trees 
described  in  Section  1  have  a  basic  delivery  cycle  time 
of  ©(Ig^  n)  because  the  concentrator  switches  have  log- 
aritlimic  depth.  We  have  discovered  a  simpler  on-line 
fat-tree  with  delivery  cycle  time  of  ©(Ign),  but  unfor¬ 
tunately,  the  number  of  delivery  cycles  required  by  a 
RANDOM-like  algorithm  is  increased  by  a  factor  of  Ig  n. 
It  seems  reasonable  to  look  for  fat-tree  structures  which 
save  the  factor  of  Ig-n  in  delivery  cycle  time  without 
displacing  it  elsewhere. 

Off-line  routing 

Our  analysis  for  RANDOM  has  repercussions  for  the 
off-line  routing  case.  Since  we  have  shown  that  with 
high  probability,  the  number  of  delivery  cycles  given  by 
Figure  2  suffices  to  deliver  a  message  set  with  load  factor 
A,  there  must  exist  off-line  schedules  using  this  many 
delivery  cycles,  which  improves  the  bound  of  0(A  Ig  n) 
given  in  |8).  The  previous  off-line  bound  was  proved  by 
deterministically  constructing  a  routing  schedule  that 
achieves  the  bound.  Our  better  bound  does  not  yield  a 
deterministic  construction  of  the  routing  schedule,  but 
it  does  'ield  a  probabilistic  one. 

Larger  channel  capacities 

We  can  improve  the  results  for  on-line  routing  if  each 
channel  c  in  the  fat-tree  is  sufficiently  large,  that  is  if 
cap(L')  =  n(lgn)  Specifically,  we  can  deliver  a  message 
set  M  in  0(A(Af))  delivery  cycles  with  high  probability, 
i.e.,  we  can  meet  the  lower  bound  to  within  a  constant 
factor.  The  better  bound  is  achieved  by  the  algorithm 
HANDOM-2  s\\ov/n  in  Figure  5. 
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2  '  while  M  ^  ^  do 

3  for  each  message  m  g  M ,  choose  a  random 

number  t„,  e  {1, 2, . . . ,  «  } 

4  for  t  •—  1  to  2  do 

5  send  all  messages  m  such  that  i,„  =  t 

6  endfor 

7  2  «—  2z 

8  endwhile 

Figure  5  Algorithm  RANDOM  S  ior  routing  in  n  fnt-tree  with 
channels  of  capacity  n(lgn). 

Theorem  9  For  any  message  set  M  on  an 
n-processor  fat-tree  with  channels  of  capacity 
n(lgn),  the  probability  is  at  least  1  -  0(l/n) 
that  RANDOM'S  will  deliver  all  the  messages 
of  M  m  0[X{M)]  delivery  cycles,  if  X(M)  is 
polynomially  bounded. 

Proof.  Let  the  lower  bound  on  channel  size  be  a  Ig  n,  and 
let  n*"  be  the  polynomial  bound  on  the  load  factor  A(A'f). 
We  consider  only  the  pass  of  the  algorithm  when  z  first 
exceeds  We  ignore  previous  cycles  for 

the  analysis  of  message  routing,  except  to  note  that  the 
number  of  delivery  cycles  they  require  is  0(A(A/)j. 

We  first  consider  a  single  channel  c  within  a  single 
cycle  i  from  among  the  z  delivery  cycles  in  the  pass. 
Since  each  message  has  probability  I/2  of  being  sent  in 
cycle  t,  we  can  apply  Lemma  4  with  p  —  I/2  to  conclude 
that  the  probability  that  channel  c  is  congested  in  cycle 
t  is  at  most 

^  2~I*'  +  2I  I* '• 

1 


Since  there  are  0(n)  channels,  the  probability  that  there 
exists  a  congested  channel  in  cycle  i  is  0(l/n*^ ■*■').  Fi¬ 
nally,  since  there  are  r  <  2e2'*'''**/'‘A(A'/)  =  C>(A(A/))  = 
O(n^)  cycles,  the  probability  is  0[  l/n)  that  there  exists 
a  congested  channel  in  any  delivery  cycle  of  the  pass.  | 

Greedy  strategies 

We  have  shown  that  there  are  no  message  sets  on  which 
RANDOM  fails  to  work  well.  It  is  natural  to  wonder 
whether  a  simple  greedy  strategy  of  sending  all  undeliv¬ 
ered  messages  on  each  tlelivery  cycle,  and  letting  them 
battle  their  ways  through  the  switches,  might  be  as  ef¬ 
fective,  We  can  show  that  no  greedy  strategy  works  as 


1  while  A/  ^  0  do 

2  send  M 

3  M  •—  M  -  {messages  delivered} 

4  endwhile 

Figure  6:  Algoritlim  GREEDY  for  delivering  a  message  set 
M. 

well  as  RANDOM.  Specifically,  for  any  A  >  1,  there 
is  a  message  sol  with  load  factor  A  which  causes  the 
greedy  strategy  to  take  flfAlgn)  delivery  cycles  on  an 
n-processor  fat-tree.  This  lower-bound  result  is  based 
on  an  idea  originally  due  to  Miller  Maley  of  MIT. 

Figure  5  shows  the  greedy  algorithm.  The  code  for 
GREEDY  does  not  completely  specify  the  behavior  of 
message  routing  on  a  fat-tree  because  the  switches  have 
a  choice  os  to  which  messages  to  drop  when  there  is 
congestion.  (The  processors  also  have  this  choice,  but 
we  shall  think  of  them  as  being  switches  as  well.)  In 
the  analysis  of  RANDOM,  we  could  presume  that  all 
messages  in  a  channel  were  lost  if  the  channel  was  con¬ 
gested.  To  completely  specify  the  behavior  of  GREEDY, 
we  must  define  the  behavior  of  switches  when  channels 
are  congested. 

The  lower  bound  for  GREEDY  covers  a  wide  range  of 
switch  behaviors.  Specifically,  we  assume  the  switches 
have  the  following  properties. 

1.  Each  switch  is  greedy  in  that  it  only  drops  messages 
if  a  channel  is  congested,  and  then  only  the  mini¬ 
mum  number  necessary. 

2.  Each  switch  is  otlivious  in  that  decisions  on  which 
messages  to  drop  are  not  based  on  any  knowledge  of 
the  message  set  other  than  the  presence  or  absence 
of  messages  on  the  switch’s  input  lines. 

We  define  the  switches  of  a  fat-tree  to  be  admissible  if 
they  have  these  two  properties.  The  conditions  are  satis¬ 
fied,  for  example,  by  switches  that  drop  excess  messages 
at  random,  or  by  switches  that  favor  one  input  channel 
over  another.  An  admissible  switch  can  even  base  de¬ 
cisions  on  its  luevious  decisions,  but  it  cannot  predict 
the  future  or  make  decisions  based  on  knowing  what  (or 
how  many)  messages  it  or  other  switches  have  dropped. 

Theorem  10  Consider  an  n-processor  fat-tree 
with  admissable  switches,  where  the  channel 
capacities  grow  at  a  rate  a  in  the  range  1  < 

Q  <  2.  Then  for  any  A  >  1,  there  exists  a  mes¬ 
sage  set  with  load  factor  A  on  which  GREEDY 
requires  Jl(Algn)  delivery  cycles. 

Sketch  of  proof.  We  use  an  adversary  argument  and  con¬ 
structs  a  message  set  in  which  all  messages  are  directed 
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out  tlie  root.  We  fii'st  specify  tliat  tlie  rrmi  ;-ivitcli  is 
congested  for  n(A)  delivery  cycles  and  demand  to  know 
what  decisions  the  switch  has  made.  Of  the  two  subtrees 
of  the  root  switch,  we  call  the  one  which  provides  more 
than  half  the  delivered  messages  the  favored  subtree.  If 
both  supply  the  same  number,  we  pick  one  arbitrarily 
to  be  favored. 

We  then  recursively  design  a  message  set  for  the  un¬ 
favored  side  that  has  load  factor  A,  and  put  as  many 
messages  as  possible  on  the  favored  side  without  ex¬ 
ceeding  a  load  factor  of  A  for  the  entire  fat-tree.  We 
design  the  message  set  in  such  a  way  as  to  be  consistent 
with  our  specification  that  the  root  switch  be  congested 
for  n(A)  delivery  cycles.  The  crux  of  the  construction 
is  to  ensure  th.at  as  we  go  down  the  fat-tree  following 
unfavored  sides  of  switches,  the  messages  delivered  ear¬ 
lier  will  not  uncongest  the  switclies  lower  down.  At  each 
level  of  the  fat-tree,  we  show  that  fl(A)  delivery  cycles 
are  required.  | 

6  Concluding  remarks 

The  analysis  of  the  algorithm  RANDOM  gives  reason¬ 
ably  tight  asymptotic  bounds  on  its  performance,  but 
the  constant  factors  in  the  analysis  are  large,  in  prac¬ 
tice,  smaller  constants  probably  suffice,  but  it  is  difficult 
to  simulate  the  algorithm  to  determine  what  constants 
might  be  better.  Unlike  Valiant’s  algorithm  for  rout¬ 
ing  on  the  hypercube,  our  algorithm  does  not  have  the 
same  probabilistic  behavior  on  all  sets  of  mess.ages,  and 
therefore,  the  simulation  results  may  be  highly  corre¬ 
lated  with  the  specific  message  sets  cliosen.  The  search 
for  good  constants  is  thus  a  multidimensional  search  in 
a  large  space,  where  e.ich  data  point  represents  an  ex¬ 
pensive  simul.ation. 

The  idea  of  using  load  factors  to  analyze  arbitr.ary 
networks  is  a  natural  one.  We  have  been  successful  in 
analyzing  fat-trees  using  this  measure  of  routing  diffi¬ 
culty.  It  seems  unlikely  that  large  parallel  supercom¬ 
puters  will  only  need  to  route  permut.itions,  but  rather, 
they  will  need  some  distributed  means  to  break  ap.art 
their  message  sets  into  routable  permutations.  We  ex¬ 
pect  that  analysis  in  terms  of  load  factor  can  be  applied 
to  other  networks  with  positive  results. 
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